Pangenome graph layout by Path-Guided Stochastic Gradient Descent

Abstract Motivation The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph’s potential excessive size, this is a significant challenge. Results In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability and implementation We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.


Introduction
Reference genomes are widely used in genomics, serving as a foundation for a variety of analyses, including gene annotation, read mapping, and variant detection (Singh et al. 2022).However, this linear model is becoming obsolete given the accessibility to hundreds or even thousands of high-quality genomes.A single genome cannot fully represent the genetic diversity of any species, resulting in reference bias (Ballouz et al. 2019).In contrast, a pangenome models the entire set of genomic elements of a given population (Tettelin et al. 2008, Computational Pan-Genomics Consortium 2018, Eizenga et al. 2020, Sherman and Salzberg 2020).Pangenomes can be represented as a sequence graph incorporating sequences as nodes and their relationships as edges (Hein 1989).In the variation graph model (Garrison et al. 2018), genomes are encoded as paths traversing the nodes in the graph.
A graph layout is the arrangement of nodes and edges in an N-dimensional space.Graph layout algorithms aim to find optimal node coordinates in order to minimize overlapping nodes or edges, reduce edge crossings, and promote an intuitive understanding of the graph.One popular approach is force-directed graph drawing (Cheong and Si 2022) which uses physical simulation to produce esthetic layouts.The classical approach combines repulsive forces on all vertices and attractive forces on adjacent vertices.This is prone to get stuck in local minima, but multi-layer strategies such as the Fast Multipole Multilevel Method (FM 3 ) (Hachul and J€ unger 2005) or Stochastic Gradient Descent (SGD) implementations alleviate such a problem (Zheng et al. 2019).SGD uses the gradient of its individual terms to approximate the gradient of a sum of functions.
A pangenome graph layout can provide a human-readable visualization of genetic variation between multiple genomes.However, Zheng et al. (2019)'s algorithm has a quadratic up front cost in the number of nodes to find pairwise distances to guide the layout, making it impossible to apply to pangenome graphs with millions of nodes.Also, existing generic graph layout approaches ignore the biological information inherent in pangenome graphs.One such bioinformatics tool is BandageNG, the current state of the art for genome graph visualization.It uses FM 3 which only considers the nodes and edges of a graph.
In practice, MultiDimensional Scaling (MDS) is applied to minimize the difference between the visual distance and theoretical graph distance.This can be accomplished by using pairwise node distances to minimize an energy function.Since pangenome graphs represent genomes as paths in the graph, a reasonable distance metric would be the nucleotide distance between a pair of nodes traversed by the same path.Such path sampling would overcome the quadratic costs of previous versions of graph drawing by SGD.
Typically, force-directed layouts are hard to compute (Wang et al. 2014).Although, BandageNG applies FM 3 for layout generation, its parallelism is bound by the number of connected graph components.Alternatively, the lock-free HOGWILD! method offers a highly parallelizable and thus scalable SGD approach that can be applied when the optimization problem is sparse (Recht et al. 2011).
Here, we present a new pangenome graph layout algorithm which applies a Path-Guided SGD (PG-SGD) to use the paths as an embedded positional system to find distances between nodes, moving pairs of nodes in parallel with a modified HOGWILD! strategy.The algorithm computes the pangenome graph layout that best reflects the nucleotide sequences in the graph.To our knowledge, no generic graph layout algorithm takes into account such path encoded biological information when computing a graph's layout.
PG-SGD can be extended in any number of dimensions.In the ODGI toolkit (Guarracino et al. 2022), we provide implementations for 1D and 2D layouts.These algorithms have already been successfully applied to construct and visualize large-scale pangenome graphs of the Human Pangenome Reference Consortium (HPRC) (Guarracino et al. 2023, Liao et al. 2023).In addition, we show that PG-SGD is almost an order of magnitude faster than BandageNG.

Algorithm
While PG-SGD is inspired by Zheng et al. (2019), we designed the algorithm to work on the variation graph model (Definition 2.1).
Definition 2.1.Variation graphs are a mathematical formalism to represent pangenome graphs (Garrison 2019).In the variation graph G ¼ ðV; E; PÞ, nodes (or vertices) V ¼ v 1 . . .v jVj contain nucleotide sequences.Each node v i has a unique identifier i and an implicit reverse complement � v i .The node strand o represents the node orientation.Edges E ¼ e 1 . . .e jEj connect ordered pairs of node strands (e i ¼ ðo a ; o b Þ), defining the graph topology.Paths P ¼ p 1 . . .p jPj are series of connected steps s i that refer to node strands in the graph (p i ¼ s 1 . . .s jpij ); the paths represent the genomes embedded in the graph.
We report PG-SGD's pseudocode in Algorithm 1 and its schematic in Fig. 1.In brief, the algorithm moves one pair of nodes ðv i ; v j Þ at a time, minimizing the difference between the layout distance ld ij of the two nodes and the nucleotide distance nd ij of the same nodes as calculated along a path that traverses them.In the 2D layouts, nodes have two ends.When moving a pair of nodes, we actually move one end of each node.For clarification, an example is given in Fig. 1. v i is the node associated with the step s i sampled uniformly from all the steps in P. v j is the node associated with the step s j sampled from the same path of s i by drawing a uniform or a Zipfian distribution (Zipf 1932).The difference between nd ij and ld ij guides the update of the node coordinates in the layout.The magnitude r of the update depends on the learning rate μ.The number of iterations steers the annealing step size η which determines the learning rate μ.A large η in the first iterations leads to a globally linear (in 1D) or planar (in 2D) layout.By decreasing η, the layout adjustments become more localized, ensuring that the nodes are positioned to best reflect the nucleotide distances in the paths (i.e. in the genomes).
Originating from empirical inspection of word frequency tables, Zipf's law states that a word with rank n occurs 1=n times as the most frequent one.This law is modeled by the Zipf distribution.Sampling s j from a Zipf distribution fixed in the s i' s path position space increases the possibility to draw a nucleotide position close to s i .So there is a high chance to use small nucleotide distances nd ij to refine the layout of nodes comprising a few base pairs.The Zipf distribution is also long-tailed, with many occurrences of low frequency events.However, extremely long-range correlations might not be captured sufficiently, resulting in collapsed layouts for structures that are otherwise linear.To provide balance between global and local layout updates, in half of the updates (flip flag in Algorithm 1), the s j is sampled uniformly instead from a Zipf distribution, with uniform sampling being more favorable for global updates.Furthermore, to enhance local linearity (in 1D) or planarity (in 2D) of the graph layout, a cooling phase skews the Zipfian distribution after half of iterations have been completed.This increases the likelihood of sampling smaller nucleotide distances for the layout updates.

Implementation
We implemented PG-SGD in ODGI (Guarracino et al. 2022): the 1D version can be found in odgi sort and the 2D version in odgi layout.To efficiently retrieve path nucleotide positions, we implemented a path index.This index is a strict subset of the XG index (Garrison et al. 2018) where we avoid to use succinct SDSL data structures (Gog et al. 2014).Instead, we rely on bit-compressed integer vectors, enabling efficient retrieval of path nucleotide positions to quickly compute nucleotide distances without having to store all pairwise distances between nodes in memory.This approach ensures to scale on large pangenome graphs representing thousands of whole genomes.
Graph layout initialization can significantly influence the quality of the final layout.In the 1D implementation, by default, nodes are placed in the same order as they appear in the input graph, although we also provide support for random layout initialization.In 2D, we offer several layout initialization techniques.One approach places nodes in the first layout dimension according to their order in the input graph, adding either uniform or Gaussian noise in the second dimension.Another strategy arranges nodes along a Hilbert curve, an approach that often favors the creation of planar final layouts.We also support fixing node positions to keep nodes in the same order as they are in a selected path, such as a reference genome.This feature allows us to build reference-focused graph layouts (Supplementary Fig. S1d).
Our implementation is multithreaded and uses shared memory for storing the layout in a vector, according to the HOGWILD!strategy (Recht et al. 2011).Threads perform layout updates without any locking for additional speed up.This approach is feasible since pangenome graphs are typically sparse (Guarracino et al. 2022), with low average node degree.As a result, the updates only modify small parts of the entire layout.While the HOGWILD!SGD algorithm writes the layout updates to a shared non-atomic double vector, PG-SGD stores node coordinates in a vector of atomic doubles.This vector prevents any potential memory overwrites.Our tests revealed basically no performance loss with respect to the non-atomic counterpart.

Performance
We apply the 2D PG-SGD to the human pangenome (Liao et al. 2023) from the HPRC to show the scalability of the algorithm.Experiments were conducted on a cluster with 24 Regular nodes (32 cores/64 threads with two AMD EPYC 7343 processors with 512 GB RAM) and 4 HighMem nodes (64 cores/128 threads with two AMD EPYC 7513 processors with 2048 GB RAM).We downloaded pangenome graphs for each autosome (24 in total) and for the mitochondrial DNA.Each graph represents 90 whole human haplotypes: 44 diploid individuals plus the GRCh38 (Schneider et al. 2017) and CHM13 (Nurk et al. 2021) haploid human references (see Supplementary Table S1 for graph statistics).When applied to these pangenome graphs using one Regular node for each calculation, odgi layout's 2D PG-SGD implementation obtains the graph layouts in 50 min on average, with the highest run time observed being chromosome 16 (Supplementary Table S1).This is expected since chromosome 16 has one of the highest levels of segmentally duplicated sequence among the human autosomes Algorithm 1: Pseudocode of PG-SGD in 1D.
Pangenome graph layout by path-guided stochastic gradient descent (Martin et al. 2004).Repetitive sequences lead to graph nodes with a very high number of path steps, which are computationally expensive to work with (Guarracino et al. 2022).Memory consumption is 29.66 GB of RAM on average, with the memory peak again occurring with chromosome 16, due to the path index building phase.Given its scalability, we applied 2D PG-SGD to the full graph with all chromosomes together using a HighMem node (Supplementary Table S1).To compare, BandageNG (https:// github.com/asl/BandageNG,last accessed July 2023), the current state of the art for graph visualization, was used to calculate a 2D layout of each of the HPRC pangenome graphs.For a fair comparison, we did not rely on BandageNG's interactive GUI application, but we executed BandageNG layout, which directly emits a 2D graph layout similar to odgi layout.BandageNG was not able to produce a layout for the full graph within 7 days, hitting the wall clock time limit of the cluster.On average, PG-SGD is �8× faster than BandageNG while using �2× less memory.

Pangenome graph layouts reveal biological features
Graph visualization is essential for understanding pangenome graphs and the genome variation they represent.We show how 2D PG-SGD allows us gaining insight into biological data by looking at the graph layout structure.In Fig. 2a, the chromosomes of the HPRC graph show the large-scale structural variations in the centromeres.Focusing on the major histocompatibility complex (MHC) of chromosome 6 (Fig. 2b), the 2D layout reveals the positions and diversity of all MHC genes (Fig. 2c).In Fig. 2d, the C4A and C4B genes are highlighted.Complementary, we provide various 1D visualizations in Supplementary Fig. S1.

Discussion
We presented PG-SGD, the first layout algorithm for pangenome graphs that leverages the biological information available within the genomes represented in the graph.Other generic graph layout algorithms, such as the one offered by BandageNG, ignore this additional information.Our implementation efficiently computes the layout of pangenome graphs representing thousands of whole genomes.
Graph visualization is key for understanding genome variations and the layouts produced by PG-SGD offer an unprecedented high-level perspective on pangenome variation.We implemented PG-SGD to generate layouts in 1D and 2D.These graph projections have already been employed in constructing and analyzing the first draft human pangenome reference (Liao et al. 2023), as well as in the discovery of heterologous recombination of human acrocentric chromosomes (Guarracino et al. 2023).Furthermore, they are applied in the creation and analysis of pangenome graphs for any species (Guarracino et al. 2022, Garrison et al. 2023).Of note, there still remains a gap in interactive and scalable solutions that merge layouts of large pangenome graphs with annotation.Our algorithm will underpin new pangenome graph browsers for studying graph layouts and the genome variation they represent (https://github.com/chfi/waragraph,last accessed July 2023).
The performance analysis shows that our 2D implementation outperforms BandageNG when handling large, complex pangenome graphs.While BandageNG was not able to deliver a layout of the whole HPRC graph within 1 week, our 2D PG-SGD calculated one within one day.There are some possible optimization approaches for future work to further improve the performance of PG-SGD, making it possible for interactive use.The data structure could be optimized to improve cache performance.Moreover, the highdegree of parallelism could be further exploited by using a GPU.In BandageNG, one cannot select the number of threads for the calculations.They are automatically chosen by the number of connected components of the graph to draw.This limits its parallelism and leads to an unbalanced workload.Since BandageNG was primarily designed for assembly graphs, one may have to adjust its parameters dependent on the input graph, in order to boost the layout generation or to adjust the highlighting of desired graph features.
The classical force model of state of the art generic graph algorithms, such as FM 3 -based ones, places nodes according to their attractive and repulsive forces.This force can be seen as equivalent to how our 2D PG-SGD moves the nodes' ends in 2D: If the nucleotide distance of the randomly chosen path steps is smaller than the layout distance of the nodes' ends, we move them closer together ("attractive force"), else we move them further away ("repulsive force").However, the key difference here is that this approach is path-guided: paths represent biological sequences in pangenome graphs, so it is as if PG-SDG considers a "biological force" for placing the graph nodes.Theoretically, it would be possible to combine our approach with a force-directed one.Combining both methods, we might get the best of both worlds: multithreadable PG-SGD iteratively applied to different graph layout-levels.We can imagine that such an approach can lead to a further speedup when calculating the layout.However, for generic graphs, this would only work if path information for each node could be added: we would replace the classical physical simulation approach with our path-guided method.If such information is not available, one could randomly cover the graph with paths.This function is already provided in odgi cover.However, this is an NP-hard problem and our preliminary solutions proved ineffective.
With assembly graphs we face the same problem: they usually do not carry path information during each assembly step.One could map the initial assembly reads back against the assembly graph in order to build paths through the graph.This would allow us to obtain a layout using PG-SGD.
PG-SGD can be extended to any number of dimensions.It can be seen as a graph embedding algorithm that converts high-dimensional, sparse pangenome graphs into lowdimensional, dense, and continuous vector spaces, while preserving its biologically relevant information.This enables the application of machine learning algorithms that use the graph layout for variant detection and classification.Our future research involves leveraging these graph projections to detect structural variants and to identify and correct assembly errors.Moreover, we are considering extending the algorithm to RNA and protein sequences to support pantranscriptome graphs (Sibbesen et al. 2023) and panproteome graphs (Dabbaghie et al. 2023), respectively.

Figure 1 .
Figure 1.2D PG-SGD update operation sketches.(a) The path information of the graph.path1 and path2 both visit the same first node.Then their sequence diverges and they visit distinct nodes.(b-e) v i /v j or v i /v k is the current pair of nodes to update.ld ij /ld ik is the current layout distance.r; − r is the current size of the update.(b) Initial graph layout highlighting the future update of the two nodes of path1.(c) The graph layout after the first update.The nodes appear longer now, because we updated at the end of the nodes.Highlighted is the future update of the two nodes of path2.(d) The graph layout after the second update.Highlighted is the future update of the two nodes of path1.(e) Final graph layout after three updates using the 2D PG-SGD.

Figure 2 .
Figure 2. 2D visualizations of all chromosomes of the Human Pangenome Reference Consortium (HPRC) 90 haplotypes pangenome graph, chromosome 6, the major histocompatibility complex (MHC), and the complement component 4 (C4).(a) odgi draw layout of the HPRC pangenome graph 90 haplotypes.Displayed are all 24 autosomes and the mitochondrial chromosome.A red rectangle highlights chromosome 6 which is shown in the subfigure below.(b) gfaestus screenshot of the chromosome 6 layout.Colored in blue is the MHC.The hairball in the middle is the centromere.The black structures in the centromere are edges.(c) gfaestus screenshot of the MHC.All MHC genes are color annotated and the names of the genes appear as a text overlay.(d) gfaestus screenshot of the region around C4, specifically color highlighting genes C4A and C4B.The black lines are the edges of the graph.